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Abstract 

We consider the problem of Bayesian learning on sensitive datasets and present two 
simple but somewhat surprising results that connect Bayesian learning to “differential 
privacy”, a cryptographic approach to protect individual-level privacy while permiting 
database-level utility. Specifically, we show that that under standard assumptions, 
getting one single sample from a posterior distribution is differentially private “for free”. 

We will see that estimator is statistically consistent, near optimal and computationally 
tractable whenever the Bayesian model of interest is consistent, optimal and tractable. 
Similarly but separately, we show that a recent line of works that use stochastic gradient 
for Hybrid Monte Carlo (HMC) sampling also preserve differentially privacy with minor 
or no modifications of the algorithmic procedure at all, these observations lead to an 
“anytime” algorithm for Bayesian learning under privacy constraint. We demonstrate 
that it performs much better than the state-of-the-art differential private methods on 
synthetic and real datasets. 
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1 Introduction 


Bayesian models have proven to be one of the most successful classes of tools in machine 
learning. It stands out as a principled yet conceptually simple pipeline for combining expert 
knowledge and statistical evidence, modelling with complicated dependency structures and 
harnessing uncertainty by making probabilistic inferences (Geman & Geman, 1984; Gelman 
et ah, 2014). In the past few decades, the Bayesian approach has been intensively used 
in modelling speeches (Rabiner, 1989), text documents (Blei et ah, 2003), images/videos 
(Fei-Fei &: Perona, 2005), social networks (Airoldi et ah, 2009), brain activity (Penny et ah, 
2011), and is often considered gold standard in many of these application domains. Learning 
a Bayesisan model typically involves sampling from a posterior distribution, therefore the 
learning process is inherently randomized. 

Differential privacy (DP) is a cryptography-inspired notion of privacy (Dwork, 2006; Dwork 
et ah, 2006). It is designed to provide a very strong form of protection of individual user’s 
private information and at the same time allow data analyses to be conducted with proper 
utility. Any algorithm that preserves differential privacy must be appropriately randomized 
too. For instance, one can differential-privately release the average salary of Californian 
males by adding a Laplace noise proportional to the sensitivity of this figure upon small 
perturbation of the data sample. 

In this paper, we connect the two seemingly unrelated concepts by showing that under 
standard assumptions, the intrinsic randomization in the Bayesian learning can be exploited 
to obtain a degree of differential privacy. In particular, we show that: 

• Any algorithm that produces a single sample from the exact (or approximate) pos¬ 
terior distribution of a Bayesian model with bounded log-likelihood is e (or (e, h))- 
differentially private^. By the classic results in asymptotic statistics (Le Cam, 1986; 
Van der Vaart, 2000), we show that this posterior sample is a consistent estimator 
whenever the Bayesian model is consistent; and near optimal whenever the asymptotic 
normality and efficiency of the maximum likelihood estimate holds. 

• The popular large-scale sampler Stochastic Gradient Langevin Dynamics (Welling 
&: Teh, 2011) and extensions, e.g. Ahn et al. (2012); Chen et al. (2014); Ding et al. 
(2014) obey (e, (5)-differentially private with no algorithmic changes when the stepsize 
is chosen to be small. This gives us a procedure that can potentially output many 
(correlated) samples from an approximate posterior distribution. 

These simple yet interesting findings make it possible for differential privacy to be explicitly 
considered when designing Bayesian models, and for Bayesian posterior sampling to be used 
as a valid DP mechanism. We demonstrate empirically that these methods work as well as 

^Similar observations were made in Mir (2013) and Dimitrakakis et al. (2014) under slightly different 
regimes and assumptions, and we will review them among other related work in Section 6. 
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or better than the state-of-the-art differential private empirical risk minimization (ERM) 
solvers using objective perturbation (Chaudhuri et ah, 2011; Kifer et ah, 2012). 

The results presented in this paper are closely related to a number of previous work, e.g., 
McSherry &: Talwar (2007); Mir (2013); Bassily et al. (2014); Dimitrakakis et ah (2014). 
Proper comparisons with them would require the knowledge of our results, thus we will 
defer detailed comparisons to Section 6 near the end of the paper. 


2 Notations and Preliminary 

Throughout the paper, we assume data point x £ X and 0 G © is the model. This can 
be the hnite dimensional parameter of a single exponential family model or a collection of 
these in a graphical model, or a function in a Hilbert space or other infinite dimensional 
objects if the model is nonparametric. vr(0) denotes a prior blief of the model parameters 
and p{x\0) and £(x\6) are the likelihood and log-likelihood of observing data point x given 
model parameter 0. If we observe X = {xi, ...,Xn}, the posterior distribution 

J nil p(a!i|e)jr(e)<!ir 

denotes the updated belief conditioned on the observed data. Learning Bayesian models 
correspond to finding the mean or mode of the posterior distribution, but often, the entire 
distribution is treated as the output, which provides much richer information than just 
a point estimator. In particular, we get error bars of the estimators for free (credibility 
intervals). 

Ignoring the philosophical disputes of Bayesian methods for the moment, practical challenges 
of Bayesian learning are often computational. As the models get more complicated, often 
there is not a closed-form expression for the posterior. Instead, we often rely on Markov 
Chain Monte Carlo methods, e.g., Metropolis-Hastings algorithm (Hastings, 1970) to 
generate samples. This is often prohibitively expensive when the data is large. One recent 
approach to scale up Bayesian learning is to combine stochastic gradient estimation as in 
Robbins & Monro (1951) and Monte Carlo methods that simulates stochastic differential 
equations, e.g. Neal (2011). These include Stochastic Gradient Langevin dynamics (SOLD) 
(Welling & Teh, 2011), Stochstic Gradient Fisher scoring (SGFS) (Ahn et ah, 2012), 
Stochastic Gradient Hamiltonian Monte Carlo (SGHMC) (Chen et ah, 2014) as well as more 
recent Stochastic Gradient Nose-Hoover Thermostat (SGNHT) (Ding et ah, 2014). We 
will describe them with more details and show that these series of tools provide differential 
privacy as a byproduct of using stochastic gradient and requiring the solution to not collapse 
to a point estimate. 
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2.1 Differential privacy 


Now we will talk about what we need to know about differential privacy. Let the space of 
data be X and data points X,Y € X'^. Define d{X, Y) to be the edit distance or Hamming 
distance between data set X and Y, for instance, if X and Y are the same except one data 
point, then d{X,Y) = 1. 

Definition 1. (Differential Privacy) We call a randomized algorithm A {e, 6)-differentially 
private with domain X^ if for all measurable set S C Range{A) and for all X,Y & X^ such 
that d{X, Y) < 1, we have 

P(.A(X) €S)< exp(e)P(^(y) G 5) + 5. 

If 6 = 0, then A is the called e-differential private. 

This definition naturally prevents linkage attacks and the identification of individual 
data from adversaries having arbitrary side information and infinite computational power. 
The promise of differential privacy has been interpreted in statistical testing, Byesian 
inference and information theory for which we refer readers to Chapter 1 of (Dwork & Roth, 
2013). 

There are several interesting properties of differential privacy that we will exploit here. 
Firstly, the definition is closed under postprocessing. 

Lemma 2 (Postprocessing immunity). If A is an (e, 5)-DP algorithm, Bo A is also (e, 5)-DP 
algorithm for any B. 

This is natnral because otherwise the whole point of differential privacy will be forfeited. 
Also, the definition automatically allows for cases when the sensitive data are accessed more 
than once. 

Lemma 3 (Composition rule). If algorithm Ai is {ei,6i)-DP, and A2 is {€2,62)-DP then 
(.4,1 (g) . 42 ) is (ei + 62 , (5i + 52)-DP. 

We will describe more advanced properties of DP as we need in Section 4. 


3 Posterior sampling and differential privacy 

In this section, we make a simple observation that under boundedness condition of a 
log-likelihood, getting one single sample from the posterior distribution (denoted by “OPS 
mechanism” from here onwards) preserves a degree of differential privacy for free. Then we 
will cite classic resnlts in statistics and show that this sample is a consistent estimator in a 
Frequentist sense and near-optimal in many cases. 
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3.1 Implicitly Preserving Differential Privacy 


To begin with, we show that sampling from the posterior distribution is intrinsically 
differentially private. 

Theorem 4. If g^Q\logp{x\0)\ < B, releasing one sample from the posterior 

distribution p{0\X'^) with any prior preserves AB-differential privacy. Alternatively, if X is 
a hounded domain (e.g., ||a:||* < R\/x ^ X) and logp(a;|0) is an L - Lip schitz function in 
II • II* for any 6 € Q, then releasing one sample from the posterior distribution preserves 
ALR-differential privacy. 


Proof The posterior distributionp(0|®i,. ^p{^l\e)p{e)de • For any aji, 
The ratio can be factorized into 


p{e\xi,...,x'^,...,Xn) 

p{e\xi,...,Xk,...,Xn) 

It follows that 


Pjx'kW) 

\{l=ip{xi\o)p{e) 

" -V-' 

Factor 1 


!eWA=iPi^i\^)pi^)dG 

Y\i=l:n,i^kP{Xi\^)p{^)dO ' 

" -V-' 

Factor 2 


Factor 1 = = ffogp[P^\e)-\ogp(^^\e) < ^ 2 B^ 

p[.Xk\0) 

^eW^^kP{^^\(^)p{(i)pMde _ I0n^^kPi^^\^)Pi^)Pi^'k\^)^)de 
.fePi^k\^)Ui^kPixi\^)pi^)dO l0PiXk\^)Yli^kPixi\^)Pi^)dO 

feP(^'k\^) Ui^kP(.Xi\^)pi^)dO 

~ m{xi,...,x'j^,...,Xn) 


where we use m{X) to denote the marginal distribution. As a result, the whole thing is 
bounded by e^^. 

Alternatively, we can use the Lipschitz constant and boundedness to get logp(a;'^|0) — 
logp(a;fc|0) < T||x'^ - tCfcll* < 2Li?. ■ 


Readers familiar with differential privacy must have noticed that this is actually an instance 
of the exponential mechanism (McSherry & Talwar, 2007), a general procedure that preserves 
privacy while making outputs with higher utility exponentially more likely. If one sets 
the utility function to be the log-likelihood and the privacy parameter being 4i?, then we 
get exactly the one-posterior sample mechanism. This exponential mechanism point of 
view provides an an simple extension which allows us to specify e by simply scaling the 
log-likelihood (see Algorithm 1). We will overload the notation OPS to also represent this 
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Algorithm 1 One-Posterior Sample (OPS ) estimator 

input Data X, log-likelihood function satisfying sup^, ^ ||£(a:|0)|| < B a prior vr(-). 
Privacy loss e. 

1. Set p = min{l, 

2. Re-define log-likelihood function and the prior := p^{-\-) and '7r'(-) := (7r(-))^. 

output 0 ~ P[0\X) oc exp (e£i <'(»!*.))»'(«). 


mechanism where we can specify e. The nice thing about this algorithm is that there is 
almost zero implementation effort to extend all posterior sampling-based Bayesian learning 
models to have differentially privacy of any specihed e. 


Assumption on the boundedness. The boundedness on the loss-function (log-likelihood 
here) is a standard assumption in many DP works (Chaudhuri et ah, 2011; Bassily et ah, 
2014; Song et ah, 2013; Kifer et ah, 2012). Lipschitz constant L is usually small for continu¬ 
ous distributions (at least when the parameter space © is bounded). This is a bound on 
logp(a;|0)) so as long as p{x\ 0 ) does not increase or decrease super exponentially fast at any 
point, L will be a small constant. R can also be made small by a simple preprocessing step 
that scales down all data points. In the aforementioned papers that assume L, it is typical 
that they also assume R = 1 for convenience. So we will do the same. In practice, we can 
algorithmically remove large data points from the data by some predehned threshold or 
using the “Propose-Test-Release” framework in (Dwork & Lei, 2009) or perform weighted 
training where we can assign lower weight to data points with large magnitude. Note that 
this is a desirable step for the robustness to outliers too. Exponential families (in Hilbert 
space) are an example, see e.g. Bialek et al. (2001); Hofmann et al. (2008); Wainwright & 
Jordan (2008). 

3.2 Consistency and Near-Optimality 

Now we move on to study the consistency of the OPS estimator. In great generality, we will 

show that the one-posterior sample estimator is consistent whenever the Bayesian model is 

posterior consistent. Since the consistency in Bayesian methods can have different meanings, 

we briefly describe two of them according to the nomenclature in Orbanz (2012). 

Definition 5 (Posterior consistency in the Bayesian Sense). For a prior vr, we say the 

model is posterior consistent in the Bayesian sense, if 6 ^ ^(^); •••) ~ Pe, o-nd the 

posterior ir,\ \ 

^ TT[&\xi, ...,Xn) — > Oe a.s. vr. 

60 is the dirac-delta function at 6 . 
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In great generality, Doob’s well-known theorem guarantees posterior consistency in the 
Bayesian sense for a model with any prior under no conditions except identifiability and 
measurability. A concise statement of Doob’s result can be found in Van der Vaart (2000, 
Theorem 10.10)). 

An arguably more reasonable definition is given below. It applies to the case when the 
statistician who chooses the prior tt does not know about the true parameter. 

Definition 6 (Posterior consistency in the Frequentist Sense). For a prior tt, we say the 
model is posterior consistent in the Frequentist sense, if for every Oq G 0, xi, ...,Xn ~ pe, 
the posterior 

n{e\xi,...,Xn) —^ Oog a.s. pog. 

This type of consistency is much harder to satisfy especially when 0 is an infinite dimensional 
space, in which case the consistency often depends on the specific priors to use. A promising 
series of results on the consistency for Bayesian nonparametric models can be found in 
Ghosal (2010)). 

Regardless which definition one favors, the key notion of consistency is that the posterior 
distribution to concentrates around the true underlying 6 that generates the data. 
Proposition 7. The one-posterior sample estimator is consistent if and only if the Bayesian 
model is posterior consistent (in either Definition 5 or 6 ). 

Proof. The equivalence follows from the standard equivalence of convergence weakly and 
convergence in probability when a random variable converges weakly to a point mass. ■ 

How about the rate of convergence? In the low dimensional setting when 0 G 0 C M'’* and 
Peix) is suitably differentiable and the prior is supported at the neighborhood of the true 
parameter, then by the Bernstein-von Mises theorem (Le Cam, 1986), the posterior mean is 
an asymptotically efficient estimator and the posterior distribution converges in Li-distance 
to a normal distribution with covariance being the inverse Fisher Information. 
Proposition 8. Under the regularity conditions where Bernstein-von Mises theorem holds, 
the One-Posterior sample 6 ~ Ti{d\xi, ..,Xn) obeys 

^{e-Oo) “-^^A?(o,2ri), 

i.e., the One-Posterior sample estimator has an asymptotic relative efficiency of 2. 

Proof. Let the One-Posterior sample 0 ~ ti{6\xi, ..,s„). By Bernstein-von Mises theorem 
y/n{6 — 6 ) A?(0,1“^). By the asymptotical normality and efficiency of the posterior 

mean estimator ^Jn{0 — 0o) A?(0,I“^). The proof is complete by taking the sum 

of the two asymptotically independent Gaussian vectors (6 and 0 — 6 are asymptotically 
independent). ■ 


The above proposition suggests that in many interesting classes of parametric Bayesian 
models, the One-Posterior Sample estimator is asymptotically near optimal. Similar 
statements can also be obtained for some classes of semi-parametric and nonparametric 
Bayesian models (Ghosal, 2010), which we leave as future work. 

The drawback of the above two propositions is that they are only stated for the version of 
the OPS when e = AB. Using results in De Blasi & Walker (2013) and Kleijn et al. (2012) for 
Bayesian learning under misspecified models, we can prove consistency, asymptotic normality 
for any e and parameterize the asymptotic relative efficiency of the OPS estimator as a 
function of e. The key idea is that when scaling the log-likelihood and sample from a different 
distribution, we are essentially fitting a model that may not include the data-generating 
true distribution. De Blasi &: Walker (2013) shows that under mild conditions, when the 
model is misspecified, the posterior distribution will converge to a point mass 9* that 
minimizes the KL-divergence between between the true distribution and the corresponding 
distribution in the misspecified model. 6* is essentially MLE and in our case, since we 
only scaled the distribution, the MLE will remain exactly the same. De Blasi & Walker 
(2013) ’s result is quite general and covers both parametric and nonparametric Bayesian 
models and whenever their assumptions hold, the OPS estimator is consistent. Using a 
similar argument and the modifed Bernstein-Von-Mises theorem in Kleijn et al. (2012), 
we can prove asymptotic normality and near optimality for the subset of problems where 
regularities of MLE hold. 

Proposition 9. Under the same assumption as Proposition 8, if we set a different e by 
rescaling the log-likelihood by a factor of then the the One-Posterior sample estimator 
obeys 

in other word, the estimator has an ARE of (1 -|- ^). 

Proof. By scaling the log-likelihood, we are essentially changing the correct model pe 
to a misspecified model (pe)^. Let the true log-likelihood be i and the misspecified 
log-likelihood be £ = in addition, define 

2 2 

U(0) := Eevmvlief = ^Eeve{e)vi{ef = 
m := -EeV^m = -^EeV^iO) = 

The last equality holds under the standard regularity conditions. By the sandwich formula, 
the maximum likelihood estimator 9 under the misspecified model is asymptotically normal: 

^{9 - 9*) AA(0, j-VJ(-1)) = W(o,r^) 


9 




where 6* defines the closest (in terms of KL-divergence) model in the misspecified class of 
distributions to the true distribution that generates the data. Since the difference is only in 
scaling, the minimum KL-divergence is obtained at 0* = 6. Now under the same regularity 
conditions, we can invoke the modified Bernstein-Von-Mises theorem for misspecified models 
(Kleijn et ah, 2012, Lemma 2.2), which says that the posterior distribution p[0\X'^) (of the 
misspecified model) converges in distribution to N{9, {nJ)~^). In our case, {nJ)~^ = 

The proof is concluded by noting that the posterior sample is an independent draw. ■ 

We make a few interesting remarks about the result. 

1. Proposition 9 suggests that for models with bounded log-likelihood, OPS is only a 
factor of (1-1-45/e) away from being optimal. This is in sharp contrast to most previous 
statistical analysis of DP methods that are only tight up to a numerical constant 

(and often a logarithmic term). In .^ 2 -norm, the convergence rate is 0{ —^)- 

The bound depends on the dimension through the Frobenius norm which is usually 
0{Vd). The bound can be further sharpened using assumptions on the intrinsic rank, 
incoherence conditions or the rate of decays in eigenvalues of the Fisher information. 

In .^oo-norm, the convergence rate is which does not depend on the 

dimension of the problem. 

2. Another implication is on statistical inference. Proposition 9 essentially generalizes 
that classic results in hypothesis testing and conhdence intervals, e.g., Wald test, 
generalized likelihood ratio test, can be directly adopted for the private learning 
problems, with an appropriate calibration using e. We can control the type I error 
in an asymptotically exact fashion. In addition, the trade-off with e and the test 
power is also explicitly described, so in cases where the power of the tests are well- 
studied (Lehmann &: Romano, 2006), the same handle can be used to analyze the 
most-powerful-test under privacy constraints. 

3. Lastly, the results in De Blasi &: Walker (2013) and Kleijn et al. (2012) are much 
more general. It is easy to extend the guarantee for OPS to handle private Bayesian 
learning in a fully agnostic setting and in non-iid cases. We will leave the formalization 
of these claims as future directions. 

3.3 (Efficient) sampling from approximate posterior 

The privacy guarantee in Theorem 4 requires sampling from the exact posterior. In practice, 
however, exact samplers are rare. As Bayesian models get more and more complicated, 
often the only viable option is to use Markov Chain Monte Carlo (MCMC) samplers which 
are almost never exact. There are exceptions, e.g., Propp &: Wilson (1998) but they only 
apply to problems with very special structures. A natural question to ask is whether we can 
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still say something meaningful about privacy when the posterior sampling is approximate. 
It turns out that we can, and the level of approximation in privacy is the same as the level 
of approximation in the sampling distribution. 

Proposition 10. If A that sampling from distribution Px preserves e-differential privacy, 
then any approximate sampling procedures A! that produces a sample from P'x such that 
\\Px — PxWli ^ ^ for o,ny X preserves (e, (1 + e^)6)-differential privacy. 

Proof. For any S G Range(.4'), and d{X,X') < 1 

P [A!{X) eS)= [ dP^< [ dPx +6 
Js Js 

< e" / dPx' + (5 < e" [ dPx' 

Js Js 

< / dP'xi + (1 + e'^)6 

Js 

= eT {A'{X') G 5) + (1 + e")<5, 

This is (e, (1 + e^)(5)-DP by definition. ■ 

We are using Li distance of the distribution because it is a commonly accepted metric 
to measure the convergence rate MCMC Rosenthal (1995), and Proposition 10 leaves a 
clean interface for computational analysis in determining the number of iterations needed 
to attain a specific level of privacy protection. 


A note on computational efficiency. The (unsurprising) bad news is that even ap¬ 
proximate sampling from the posterior is NP-Hard in general, see, e.g. Sontag &: Roy 
(2011, Theorem 8). There are however interesting results on when we can (approximately) 
sample efficiently. Approximation is easy for sampling LDA when a > 1 while NP-Hard 
when a < 1. A more general result in Applegate &: Kannan (1991) suggests that we can 
get a sample with arbitrarily close approximation in polynomial time for a class of near 
log-concave distributions. The log-concavity of the distributions would imply convexity in 
the log-likelihood, thus, this essentially confirms the computational efficiency of all convex 
empirical risk minimization problems under differential privacy constraint (see Bassily et al. 
(2014)). 

The nice thing is that since we do not modify the form of the sampling algorithm at all, 
the OPS algorithm is going to be a computationally tractable DP method whenever the 
Bayesian learning model of interest is proven to be computationally tractable. 

This observation provides an interesting insight into the problem of computational lower 
bound of differential private machine learning. Unlike what is conjectured in Dwork et al. 
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(2014), our observation seems to suggest that the computational barrier is not specific to 
differential privacy, but rather the barrier of learning in general. The argument seems to 
hold at least for some class of problems, where the posterior sample achieves the optimal 
statistical rate and is at least 4i?-DP. 


3.4 Discussions and comparisons 

OPS has a number of advantages over the state-of-the-art differentially private ERM 
method: objective perturbation (Chaudhuri et ah, 2011; Kifer et ah, 2012) (ObjPert 
from here onwards). OPS works with arbitrary bounded loss functions and priors while 
ObjPert needs a number of restrictive assumptions including twice differentiable loss 
functions, strongly convexity parameter to be greater than a threshold and so on. These 
restrictions rule out many commonly used loss functions, e.g., £i-loss, hinge loss, Huber 
function just to name a few. 

Also, ObjPert ’s privacy guarantee holds only for the exact optimal solution, which is 
often hard to get in practice. In contrast, OPS works when the sample is drawn from an 
approximate posterior distribution. From a practical point of view, since OPS stems from 
the intrinsic privacy protection of Bayesian learning, it requires very little implementation 
effort to deploy it for practical applications. It also requires the problem to be strong 
convexity with a minimum strong convexity parameter. When the condition is not satisfied, 
ObjPert will need to add additional quadratic regularization to make it so, which may 
bias the problem unnecessarily. 


4 Stochastic Gradient MCMC and (e, -Differential privacy 

Given a fixed privacy budget, we see that the single posterior sample produces an optimal 
point estimate, but what if we want multiple samples? Can we use the privacy budget in a 
different way that produces many approximate posterior samples? 

In this section we will provide an answer to it by looking at a class of Stochastic Gradient 
MCMC techniques developped over the past few years. We will show that they are also 
differentially private for free if the parameters are chosen appropriately. 

The idea is to simply privately release an estimate of the gradient (as in Song et al. 
(2013); Bassily et al. (2014)) and leverage upon the following two celebrated lemmas in 
differential privacy in the same way as Bassily et al. (2014) does in deriving the near-optimal 
(e, (5)-differentially private SGD. 

The first lemma is the advanced composition which allows us to trade off a small amount of 
6 to get a much better bound for the privacy loss due to composition. 
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Lemma 11 (Advanced composition, c.f.,Theorem 3.20 in (Dwork & Roth, 2013)). For all 
e, 6, 5' > 0, the class of (e, 6)-DP mechanisms satisfy k6 + 5')-DP under k-fold adaptive 
composition for: 

e' = \/ 2k\og{l / 5')e + ke{e^ — 1 ). 

Remark 12. When e = ^ 2 A :1 ^ (i/ ^ ^ ^ some constant c < y^log(l/(5'), we can simplify 

the above expression into e' < 2c. To see this, apply the inequality — 1 < 2e (easily shown 
via Taylor’s theorem and the assumption that e<l). 

In addition, we will also make use of the following lemma due to Beimel et al. (2014). 
Lemma 13 (Privacy for subsampled data. Lemma 4.4 in Beimel et al. (2014).). Over 
a domain of data sets , if an algorithm A is {e,6) differentially private (with e <1), 
then for any data set X G X^, running A on a uniform random ■yN-entries of X ensures 
{2je,6)-DP. 

To make sense of the above lemma, notice that we are subsampling uniform randomly and 
the probability of any single data point being sampled is only 7 . Thus, if we arbitrarily 
perturb one of the data points, its impact is evenly spread across all data points thanks to 
random sampling. 

Let / : X^ —)• be an arbitrary d-dimensional function. Define the £2 sensitivity of / to 

be 

A 2 /= sup ||/(A)-/(y)|| 2 . 

Y-.d{X,Y)<i 

Suppose we want to output f{X) differential privately, “Gaussian Mechanism” output 
f{X) = f{X) + AA(0, (T^/rf) for some appropriate a. 

Theorem 14 (Gaussian Mechanism, c.f. Dwork & Roth (2013)). Let e G (0, 1) he arbitrary. 
“Gaussian Mechanism” with a > X 2 f \J2 log(1.25/5)/e is {e, 5)-differentially private. 

This will be the main workhorse that we use here. 

4.1 Stochastic Gradient Langevin Dynamics 

SGLD iteratively update the parameters to by running a perturbed version of the minibatch 
stochastic gradient descent on the negative log-posterior objective function 

N N 

- ^ \ogp{xi\e) - log 7r(0) =: ^ i{xi] 0) r{9) 

i=l i=l 

where £{xi] 6) and r[0) are loss-function and regularizer under the empirical risk minimiza¬ 
tion. 
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If one were to run stochastic gradient descent or any other optimization tools on this, one 
would eventually a deterministic maximum a posteriori estimator. SGLD avoids this by 
adding noise in every iteration. At iteration t SGLD first samples uniform randomly r data 
points ...,xt 2 } and then updates the parameter using 

Ot+i = 9t-m (^Vr( 0 ) + 7 E (1) 

where zt ~ AA(0, rjt) and r is the mini-batch size. 

For the ordinary stochastic gradient descent to converge in expectation, the stepsize r]t 
can be chosen as anything that Vt = ^ and Vt < ^ (Robbins & Monro, 1951). 
Typically, one can chooses stepsize r]t = a{b -1- t)~^ with 7 G (0.5,1]. In fact, it is shown 
that for general convex functions and ^-strongly convex functions and ^ can be used 

to obtain the minimax optimal 0(1/\/t) and 0{l/t) rate of convergence. These results 
substantiate the first phase of SGLD: a convergent algorithm to the optimal solution. Once 
it gets closer, however, it transforms into a posterior sampler. According to Welling &: 
Teh (2011) and later formally proven in Sato & Nakagawa (2014), if we choose ijt —)• 0, the 
random iterates Of of SGLD converges in distribution to the p{ 6 \X). The idea is that as 
the stepsize gets smaller, the stochastic error from the true gradient due to the random 
sampling of the minibatch converges to 0 faster than the injected Gaussian noise. 

In addition, if we use some fixed stepsize lower bound, such that rjt = max{l/(t + 1),%} 
(to alleviate the slow mixing problem of SGLD), the results correspond to a discretization 
approximation of a stochastic differential equation (Fokker-Planck equation), which obeys 
the following theorem due to Sato &: Nakagawa (2014) (simplified and translated to our 
notation). 

Theorem 15 (Weak convergence (Sato &: Nakagawa, 2014)). Assume f{ 6 \X) is differen¬ 
tiable, Vf{ 6 \X) is gradient Lipschitz and bounded Then 

\^e^pie\x)[H(^)] - ^er^SGLD[h{9{t))] \ = 0 {r]t), 
for any eontinuous and polynomial growth function h. 

This theorem implies that one can approximate the posterior mean (and other estimators) 
using SGLD. Finite sample properties of SGLD is also studied in (Vollmer et ah, 2015). 

Now we will show that with a minor modification to just the “burn-in” phase of SGLD, we 
will be able to make it differentially private (see Algorithm 2 ). 

Theorem 16 (Differentially private Minibatch SGLD). Assume initial 61 is chosen inde¬ 
pendent of the data, also assume £{x\0) is L-smooth in || • II 2 for any x £ X and 0 G 0. 

^We use boundedness to make the presentation simpler. Boundedness trivially implies the linear growth 
condition in Sato & Nakagawa (2014, Assumption 2). 
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In addition, let e,6,T,T be chosen such that T > 32 t iog< 2 /< 5 ) • Algorithm 2 preserves 

{e, 5)-differential privacy. 

Proof. In every iteration, the only data access is V^(£Cj|0) and by the L-Lipschitz 

condition, the sensitivity of most 2L. Get the essential noise that is 

added to \7£{xi\0) by removing the ^2 factor from the variance a‘‘ in the algorithm, 

and Gaussian mechanism, ensures the privacy loss to be smaller than , with 

’ ^ ^ V32rTlog(2/5) 

probability > 1 — 2 ^- 

Using the same technique in Bassily et al. (2014), we can further exploit the fact that the 
subset S that we use to compute the stochastic gradient is chosen uniformly randomly. By 
Lemma 13, the privacy loss for this iteration is in fact 

ey/N 2 t e /2 

y/32TTlog{2/6~) ' 'N ~ y^2(iVr/r)log(27ly' 

Verify that we can indeed do that as ^32 (^ 2 /s) ^ ^ the assumption on T. Note that 

to get T data passes with minibatches of size r, we need to go through at most 
iterations. Apply the advanced composition theorem (Remark 12), we get an upper bound 
of the total privacy loss e and failure probability 5=1 + accordingly. 

The proof is complete by noting that choosing a larger noise level when rjt is bigger can 
only reduces the privacy loss under the same failure probability. ■ 


a-Phase transition. For any a G (0,1), if we choose rjt = i 28L^iog(2.5jVT/++)iog(2/^)t ’ 
then whenever t > aNT/r, then we are essentially running SGLD for the last (1 — a)NT/T 
iterations, and we can collect approximate posterior samples from there. 


Algorithm 2 Differentially Private Stochastic Gradient Langevin Dynamics (DP-SGLD) 
Require: Data X of size N, Size of minibatch r, number of data passes T, privacy 
parameter e,5, Lipschitz constant L and initial 61 . Set t = 1. 
for t = 1 : [NT /rj do 

1. Random sample a minibatch S C [N] of size r. 

2. Sample each coordinate of Zt iid from M ^0, ^ 2.5Nt '^ \og{2/6)rjf V . 

3. Update Ot+i ^ 6t - ip (Vr(0) + ^ XlieS '^^{xi\0)) + Zt, 

4. Return Ot+i as a posterior sample (after a pre-defined burn-in period). 

5. Increment t t + 1. 

end for 
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Small constant r]Q. Instead of making rjt to converge to 0 as t increases, we may 
alternatively use constant r^o after t is larger than a threshold. This is a suggested heuristic 
in Welling & Teh (2011) and is inline with the analysis in Sato & Nakagawa (2014) and 
Vollmer et al. (2015). 


Choice of T and r By Bassily et al. (2014), it takes at least N data passes to converge in 
expectation to a point near the minimizer, so taking T = 2N is a good choice. The variance 
of both random components in our stochastic gradient is smaller when we use larger r. 
Smaller variances would improve the convergence of the stochastic gradient methods and 
make the SOLD a better approximation to the full Langevin Dynamics. The trade-off is 
that when r is too large, we will use up the allowable T datapasses with just 0{T) iterations 
and the number of posterior samples we collect from the algorithm will be small. 


Overcoming the large-noise in the “Burn-in” phase When the stepsize rjt is not 
small enough initially, we need to inject significantly more noise than what SOLD would 
have to ensure privacy. We can overcome this problem by initializing the SGLD sampler 
with a valid output of the OPS estimator, modified according to the exponential mechanism 
so that the privacy loss is calibrated to e/2. As the initial point is already in the high 
probability region of the posterior distribution, we no longer need to “Burn-in” the Monte 
Carlo sampler so we can simply choose a sufficiently small constant stepsize so that it 
remains a valid SGLD. This algorithm is summarized in Algorithm 3. 


Comparing to OPS The privacy claim of DP-SGLD is very different from OPS . It does 
not require sampling to be nearly correct to ensure differential privacy. In fact, DP-SGLD 
privately releases the entire sequence of parameter updates, thus ensures differential privacy 
even if the internal state of the algorithm gets hacked. However, the quality of the samples 
is usually worse than OPS due to the random-walk like behavior. The interesting fact, 
however, is that if we run SGLD indefinitely without worrying about the stronger notion 
of internal privacy, it leads to a valid posterior sample. We can potentially modify the 


Algorithm 3 Hybrid Posterior Sampling Algorithm 

Require: Data X of size A, log-likelihood function i{-\9) with Lipschitz constant L in the 
first argument, assume supa,g_:^;• ||a;||, a prior tt. Privacy requirement e. 

1. Run OPS estimator: Algorithm 1 with e/2. Collect sample point 6q 

2. Run DP-SGLD (Algorithm 2) or other Stochatic Gradient Monte Carlo algorithms 
and collect samples. 

output : Return all samples. 
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posterior distribution to sample from into the “scaled” version so as to balancing the two 
ways of getting privacy. 


4.2 Hamiltonian Dynamics, Fisher Scoring and Nose-Hoover Thermo¬ 
stat 

One of the practical drawback of SOLD is its random walk-like behavior which slows down 
the mixing significantly. In this section, we describe three extensions of SOLD that attempts 
to resolve the issue by either using auxiliary variables to counter the noise in the stochastic 
gradient(Chen et ah, 2014; Ding et ah, 2014), or to exploit second order information so as 
to use Newton-like updates with large stepsize (Ahn et ah, 2012). 

We note that in all these methods, stochastic gradients are the only form of data access, 
therefore similar results like what we described for SGLD follow nicely. We briefly describe 
each method and how to choose their parameters for differential privacy. 


Stochastic Gradient Hamiltonian Monte Carlo. According to Neal (2011), Langevin 
Dynamics is a special limiting case of Hamiltonian Dynamics, where one can simply ignore 
the “momentum” auxiliary variable. In its more general form, Hamiltonian Monte Carlo 
(HMC) is able to generate proposals from distant states and hence enabling more rapid 
exploration of the state space. Chen et ah (2014) extends the full “leap-frog” method for 
HMC in Neal (2011) to work with stochastic gradient and add a “friction” term in the 
dynamics to “de-bias” the noise in the stochastic gradient. 

f 0t = 0t-i + _ f2) 

\ Pt = Pt-i - htV - mApt-i -hAA(0,2(A - B)ht). 

where H is a guessed covariance of the stochastic gradient (the authors recommend restricting 
H to a single number or a diagonal matrix) and A can be arbitrarily chosen as long as 
A y B. If the stochastic gradient V ~ AA(V, B) for some B and B = B, then this dynamics 
is simulating a dynamic system that yields the correct distribution. Note that even if the 
normal assumption holds and we somehow set B = B, we still requires ht to go to 0 to 
sample from the actual posterior distribution, and as ht converges to 0 the additional noise 
we artihcially inject dominates and we get privacy for free. All we need to do is to set A, 
B and ht so that 2{A — B)/ht >- ^ 2.5 Nt 'j iog(2/(j)/„. Note that as /it —>■ 0 this 

quickly becomes true. 


Stochastic Gradient Nose-Hoover Thermostat As we discussed, the key issue about 
SCHMC is still in choosing B. Unless B is chosen exactly as the covariance of true stochastic 
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gradient, it does not sample from the correct distribution even as /it —>■ 0 unless we trivially 
set B = 0. The Stochastic Gradient Nose-Hoover Thermostat (SGNHT) overcomes the 
issue by introducing an additional auxiliary variable which serves as a thermostat to 
absorb the unknown noise in the stochastic gradient. The update equations of SGNHT are 
given below 

( Pt =Pt-i-Ct-iPt-iht-^ht+Af{0,2Aht); 

< Ot = 9t-i + htPt_i] (3) 

I it = 6-1 + {^PtPt - 1)^1- 

Similar to the case in SGHMG, appropriately selected discretization parameter ht and the 
friction term A will imply differential privacy. 

Ghen et al. (2014); Ding et al. (2014) both described a reformulation that can be interpret 
as SGD with momentum. This is by setting parameters r] = h‘^,a = hA,b = hB for 
SGHMG: 

r = Ot-i + vt-i ^ 

\ Vt = Vt-i-r]tV-av+M{Q,2{a-b)ritI)] 

and V = ph, rji = /ij, a = ih and a = Ah for SGNHT: 


vt = vt-i - at-ivt-i - ViOt-sW +AfiO,2ar]tI); 

A Ut-i] (5) 

at = at-i + - r]t). 


where 1 — a is the momentum parameter and rj is the learning rate in the SGD with momen¬ 
tum. Again note that to obtain privacy, we need ^ log(^^) log(l/(5). 

Note that as r]t gets smaller, we have the flexibility of choosing a and r]t within a reasonable 
range. 


Stochastic Gradient Fisher Scoring Another extension of SGLD is Stochastic Gra¬ 
dient Fisher Scoring (SGFS), where Ahn et al. (2012) proposes to adaptively interpolate 
between a preconditioned SGLD (see preconditioning (Girolami & Galderhead, 2011)) and 
a Markov Ghain that samples from a normal approximation of the posterior distribution. 
For parametric problem where Bernstein-von Mises theorem holds, this may be a good 
idea. The heuristic used in the SGFS is that the covariance matrix of 9\X, which is also 
the inverse Fisher information is estimated on the fly. The key features of SGFS is 
that one can use the stepsize to trade off speed and accuracy, when the stepsize is large, 
it mixes rapidly to the normal approximation, as the stepsize gets smaller the stationary 
distribution converges to the true posterior. Further details of SGFS and ideas to privatize 
it is described in the appendix. 
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4.3 Discussions and caveats. 


So far, we have proposed a differentially private Bayesian learning algorithm that is memory 
efficient, statistically near optimal for a large class of problems, and we can release many 
intermediate iterates to construct error bars. Given that differential privacy is usually very 
restrictive, some of these results may appear too good to be true. This is a reasonable 
suspicion due to the following caveats. 


Small T] helps both privacy and accuracy. It is true that as rj goes to 0, the stationary 
distribution that these method samples from gets closer to the target distribution. On the 
other hand, since the variance of the noise we need to add for privacy scales in and 

that for posterior sampling scales like 0{r]), privacy and accuracy benefits from the same 
underlying principle. The caveat is that we also have a budget on how many samples can 
we collect. Also the smaller the stepsize rj is, the slower it mixes, as a result, the samples 
we collect from the monte carlo sampler is going to be more correlated to each other. 


Adaptivity of SGNHT. While SGNHT is able to adaptively adjust the temperature so 
that the samples that it produces remain “unbiased” in some sense as r/ —)• 0. The reality 
is that if the level of noise is too large, either we adjust the stepsize to be too small to 
search the space at all, or the underlying stochastic differential equation becomes unstable 
and quickly diverges. As a result, the adaptivity of SGNHT breaks down if the privacy 
parameter gets to small. 


Computationally efficiency. For a large problem, it is usually the case that we would 
like to train with only one pass of data or very small number of data passes. However, due 
to the condition in Lemma 13, our result does not apply to one pass of data unless r is 
chosen to be as large as N. While we can still choose T to be sufficiently large and stop 
early, but we amount of noise that we add in each iteration will remain the same. 


The Curse of Numerical constant. The analysis of algorithms often involves larger 
numerical constants and polylogarithmic terms in the bound. In learning algorithms 
these are often fine because there are more direct ways to evaluate and compare methods’ 
performances. In differential privacy however, constants do matter. This is because we need 
to use these bounds (including constants) to decide how much noise or perturbation we need 
to inject to ensure a certain degree of privacy. These guarantees are often very conservative, 
but it is intractable to empirically evaluate the actual e of differential privacy due to its 
“worst” case definition. Our stochastic gradient based differentially private sampler suffers 
from exactly that. For moderate data size, the product of the constant and logarithmic 
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Figure 1: Illustration of stochastic gradient langevin dynamics and its private counterpart 
at e = 10. 


terms can be as large as a few thousands. That is the reason why it does not perform as 
well as other methods despite the theoretically being optimal in scaling (the optimality 
result is due to SGD (Bassily et al., 2014)). 


5 Experiments 

Figure 1 is a plain illustration of how these stochastic gradient samplers work using a 
randomly generated linear regression model (note the its posterior distribution will be 
normal, as the contour illustrates). On the left, it shows how these methods converge 
like stochastic gradient descent to the basin of convergence. Then it becomes a posterior 
sampler. The figure on the right shows that the stochastic gradient thermostat is able to 
produce more accurate/unbiased result and the impact of differential privacy at the level of 
e = 10 becomes negligible. 

To evaluate how our proposed methods work in practice, we selected two binary classification 
datasets: Abalone and Adult, from the first page of UCI Machine Learning Repository 
and performed privacy constrained logistic regression on them. Specifically, we compared 
two of our proposed methods, OPS mechanism and hybrid algorithm against the state- 
of-the-art empirical risk minimization algorithm ObjPert (Chaudhuri et al., 2011; Kifer 
et al., 2012) under varying level of differential privacy protection. The results are shown 
in Figure 2. As we can see from the figure, in both problems, OPS significantly improves 
the classification accuracy over ObjPert . The hybrid algorithm also works reasonably 
well, given that it collected N samples after initializing it from the output of a run of OPS 
with privacy parameter e/2. For fairness, we used the (e, (5)-DP version of the objective 
perturbation (Kifer et al., 2012) and similarly we used Gaussian mechanism (rather than 
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Privacy loss:e in log-scale 



(a) Synthetic: classification of two normals. 


(b) Abalone: 9 features, 4177 data points. 



(c) Adult: 109 features, 32561 data points. 


Figure 2: Comparison of Differential Private methods. 


Laplace mechanism) for output perturbation. All optimization based methods are solved 
using BFGS algorithm to high numerical accuracy. OPS is implemented using SGNHT and 
we ran it long enough so that we are confident that it is a valid posterior sample. Minibatch 
size and number of data passes in the hybrid DP-SGNHT are chosen to be both ^/N. 

We note that the plain DP-SGLD and DP-SGNHT without an initialization using OPS 
does not work nearly as well. In our experiments, it often performs equally or slightly 
worse than the output perturbation. This is due to the few caveats (especially “the curse of 
numerical constant”) we described earlier. 
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6 Related work 


We briefly discuss related work here. For the hrst part, we become aware recently that Mir 
(2013) and Dimitrakakis et al. (2014) independently developed the idea of using posterior 
sampling for differential privacy. Mir (2013, Chapter 5) used a probabilistic bound of the 
log-likelihood to get (e, 6) — DP but focused mostly on conjugate priors where the posterior 
distribution is in closed-form. Dimitrakakis et al. (2014) used Lipschitz assumption and 
bounded data points (implies our boundedness assumption) to obtain a generalized notion 
of differential privacy. Our results are different in that we also studied the statistical and 
computational properties. Bassily et al. (2014) used exponential mechanism for empirical 
risk minimization and the procedure is exactly the same as OPS . Our difference is to connect 
it to Bayesian learning and to provide results on limiting distribution, statistical efficiency 
and approximate sampling. We are not aware of a similar asymptotic distribution with 
the exception of Smith (2008), where a different algorithm (the subsample-and-aggregate 
scheme) is proven to give an estimator that is asymptotically normal and efficient (therefore, 
stronger than our result) under a different set of assumptions. Specifically, Smith (2008) ’s 
method requires boundedness of the parameter space while ours method can work with 
potentially unbounded space so long as the log-likelihood is bounded. 

Related to the general topic, Kasiviswanathan & Smith (2014) explicitly modeled the 
“semantics” of differential privacy from a Bayesian point of view, Xiao &: Xiong (2012) 
developed a set of tools for performing Bayesian inference under differential privacy, e.g., 
conditional probability and credibility intervals. Williams & McSherry (2010) studied 
a related but completely different problem that uses posterior inference as a meta-post- 
processing procedure, which aims at “denoising” the privately obfuscated data when the 
private mechanism is known. Integrating Williams & McSherry (2010) with our procedure 
might lead to some further performance boost, but investigating its effect is beyond the 
scope of the current paper. 

For the second part, the idea to privately release stochastic gradient has been well-studied. 
Song et al. (2013); Bassily et al. (2014) explicitly used it for differentially private stochastic 
gradient descent. And Rajkumar & Agarwal (2012) used it for private multi-party training. 
Our Theorem 16 is a simple modification of Theorem 2.1 in Bassily et al. (2014). Bassily 
et al. (2014) also showed that the differential private SGD using Gaussian mechanism with 
r = 1 matches the lower bound up to constant and logarithmic, so we are confident that not 
many algorithms can do significantly better than Algorithm 2. Our contribution is to point 
out the interesting algorithmic structures of SOLD and extensions that preserves differential 
privacy. The method in Song et al. (2013) requires disjoint minibatches in every data pass, 
and it requires adding significantly more noise in settings when Lemma 13 applies. Song 
et al. (2013) are however applicable when we are doing only a small number of data passes 
and for these cases, it gets a much better constant. Rajkumar & Agarwal (2012) ’s setting 
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is completely different as it injects a fixed amount of noise to the gradient corresponds to 
each data point exactly once. In this way, it replicates objective perturbation (Chaudhuri 
et ah, 2011) (assuming the method actually finds the optimal solution). 

Objective perturbation is originally proposed in Chaudhuri et al. (2011) and the (e, 5) version 
that we refer to hrst appears in Kifer et al. (2012). Comparing to our two mechanisms that 
attempts to sample from the posterior, their privacy guarantee requires the solution to be 
exact while ours does not. In comparison, OPS estimator is differentially private allows the 
distribution it samples from to be approximate, DP-SGLD on the other hand releases all 
intermediate results and every single iteration is public. 


7 Conclusion and future work 

In this paper, we described two simple but conceptually interesting examples that Bayesian 
learning can be inherently differentially private. Specifically, we show that getting one 
sample from the posterior is a special case of exponential mechanism and this sample as an 
estimator is near-optimal for parametric learning. On the other hand, we illustrate that 
the algorithmic procedures of stochastic gradient Langevin Dynamics (and variants) that 
attempts to sample from the posterior also guarantee differential privacy as a byproduct. 
Preliminary experiments suggests that the One-Posterior-Sample mechanism works very well 
in practice and it substantially outperforms earlier privacy mechanism in logistic regression. 
While suffering from a large constant, our second method is also theoretically and practically 
meaningful in that it provides privacy protection in intermediate steps. 

To carry the research forward, we think it is important to identify other cases when the 
existing randomness can be exploited for privacy. Randomized algorithms such as hashing 
and sketching, dropout and other randomization used in neural networks might be another 
thing to look at. More on the application end, we hope to explore the one-posterior sample 
approach in differentially private movie recommendation. Ultimately, the goal is to make 
differential privacy more practical to the extent that it can truly solve the real-life privacy 
problems that motivated its very advent. 


A Stochastic Gradient Fisher Scoring 

A.l Fisher Scoring and Stochastic Gradient Fisher Scoring 

Fisher scoring is simply the Newton’s method for solving maximum likelihood estimation 
problem. The score function S{0) is the gradient of the log-likelihood. So intuitively, if 
we solve the equation S{9) = 0, we can obtain the maximum likelihood estimate. Often 
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this equation is highly non-linear, so we consider the an iterative update for the linearized 
score function (or a quadratic approximation of the likelihood) by Taylor expand it at the 
current point Oq 

S{d) ^ Si0o) + I{9o){0 - eo) 

where I{0o) = — Ym=i is the observed Fisher information evaluated at 0q. 

By the fact that S{0*) = 0, and plug in the above equation, we get 0* = 0q + I~^{0q)S{9o) 
Note that this is a fix point iteration and it gives us an iterative update rule to search for 
0* via 

0k+i = 0k + I~^{0k)S{0k)- 

Recall that S is the gradient of the score function and is the covariance of the score 
function and (under mild regularity conditions) the Hessian of the log-likelihood. As a 
result, this is often the same as Newton iterations. 

An intuitive idea to avoid passing the entire dataset in every iteration is to simply replacing 
the gradient (the score function) with stochastic gradient and somehow estimate the Fisher 
information. Stochastic Gradient Fisher Scoring can be thought of as a Quasi-Newton 
method. 


A.2 Privacy extension 

By invoking a more advanced version of the Gaussian Mechanism, we will show that similar 
privacy guarantee can be obtained for a modihed version of SGFS (described in Algorithm 4) 
while preserving its asymptotic properties. Specifically, under the assumption that /at is 
given, when r]t is big, it also samples from a normal approximation (with larger variance), 
when r]t is small, the private algorithm becomes exactly the same as SGFS. Moreover, for 
a sequence of samples from the posterior, the online estimate in the Fisher Information 
converges an 0{1/N) approximation of true Fisher Information as in Ahn et al. (2012, 
Theorem 1). 

The privacy result relies on a more specific smoothness assumption. Assume that for 
any parameter 0 G M'^, and X G the ellipsoid E = FB’^ defined by transforming 
the unit ball using a linear map F contains the symmetric polytope spanned by 
{±V£(xi,6*), ...,±Vi{x]\f,0)}- From a differential private point of view, this implies that 
'V$i{x,0)^s sensitivity is different towards different direction. Then the non-spherical 
gaussian mechanism states 

Lemma 17 (Non-Spherical Gaussian Mechanism). Output '^f=i'^^{xi,0) + Fw where 
y, ^ AA(0, o6e|/s (e, 5)-DP. 
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Algorithm 4 Differentially Private Stochastic Gradient Fisher Scoring (DP-SGFS) 
Require: Data X of size A, Size of minibatch r, number of data passes T, stepsize r]t 
for t = 1,..., [AT/rJ, a public Lipschitz matrix F, and initial 9i. Set t = 1, = 

32riog(2.5AfT/Tj) log(2/(5) 

Nt€^ 

for t = 1 : \_NT/r\ do 

1. Random sample a minibatch S C [A] of size r, compute g = \ 

2. Sample NFrJd), Wij ^ Af{0,A9\\F\\‘^a‘^). 

3. Compute private stochastic gradient and sample covariance matrix 

g = g + FZt, and A = Pg J ^ W(0t) - 5 } ^ 

I ^ ies 

4. Update the guessed Fisher Information estimate /* = (! — + ntV- 

5. Update and return 9t+i ^ 9t + 2 (Vr(6»t) + Ng). 

6. Increment t t + 1. 

end for 



Theorem 18. Let F be that £{x-,9') < ig + V£{x] 9)"’"{9' — 9) + ^\\F{9' — 0)|p for any 
X G A, 0 G 0. Moreover, let e, 6, r, T be chosen such that T > 32 ,-io^ 2 /< 5 ) • Algorithm 4 

guarantees {2e,26)-dijferential privacy. 

Proof. First of all, ||F ||2 is an upper bound for any Xi{x\9), so by applying Lemma 19 on 
the every set of subsamples in each iteration, by Gaussian mechanism (Lemma 14) and the 
invariance to post-processing, we know that U is a private release. Then the proof follows by 
the same line of argument (subsampling and advanced composition) as in Theorem 16 for g 
and V respectively, then the result follows by applying the simple composition theorem. ■ 

Lemma 19 (Sensitivity of the sample covariance operator). Let ||x|| < L for any x £ X, 
n > 4, then 

___ y ^2 

sup ||Cov(xi,...,Xfc,...,Xn) - Cov{xi, ...,x',^, ...,Xn)\\F < 

Proof. We prove by taking the difference of two adjacent covariance matrices and bound 
the residual. 

Cov(A') =Cov(A) -I- ^ — {xx'^ — x[x'^) -\ - ^ —r(xx^ -I- x'[x'^ — x[x'^ — x'x^) 

n — 1 n{n — 1) 

- ^—fi{x - x'f -^(x - x')li^. 

n — 1 n — 1 
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Now assume n > 4 and take the upper bound of every term, we get A 2 (Cov(X)) < 


7|d 
n—1 ‘ 
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